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Abstract — Long Reed-Solomon (RS) codes are desirable for 
digital communication and storage systems due to their improved 
error performance, but the high computational complexity of 
their decoders is a key obstacle to their adoption in practice. As 
discrete Fourier transforms (DFTs) can evaluate a polynomial 
at multiple points, efficient DFT algorithms are promising in 
reducing the computational complexities of syndrome based 
decoders for long RS codes. In this paper, we first propose 
partial composite cyclotomic Fourier transforms (CCFTs) and 
then devise syndrome based decoders for long RS codes over large 
finite fields based on partial CCFTs. The new decoders based 
on partial CCFTs achieve a significant saving of computational 
complexities for long RS codes. Since partial CCFTs have 
modular and regular structures, the new decoders are suitable for 
hardware implementations. To further verify and demonstrate 
the advantages of partial CCFTs, we implement in hardware 
the syndrome computation block for a (2720, 2550) shortened 
RS code over GF(2 12 ). In comparison to previous results based 
on Horner's rule, our hardware implementation not only has a 
smaller gate count, but also achieves much higher throughputs. 

I. Introduction 

Since syndrome-based hard-decision decoders of Reed- 
Solomon (RS) codes [ 1 1 have quadratic complexities in their 
code lengths, RS codes of short and medium lengths have 
widespread applications in modern digital communication and 
storage systems. To meet ever higher demand on error perfor- 
mance, long RS codes (or shortened RS codes |2j, (3)) over 
large finite fields have been considered in theoretical studies. 
For example, RS codes with thousands of symbols over 
GF(2 12 ) are considered in optical communication systems 
Q and magnetic recording systems Q, (5J to achieve low 
bit error rates. One of the key obstacles to the adoption of 
such long RS codes in practice is high complexity caused by 
their extreme code lengths as well as the large sizes of their 
underlying fields. 

Fast algorithms for discrete Fourier transforms (DFTs) over 
finite fields are promising techniques to overcome this obsta- 
cle. This is because all steps except the key equation solver in 
syndrome-based hard-decision RS decoders [1] — syndrome 
computation, Chien search, and error magnitude evaluation — 
are polynomial evaluations. Hence, they can be formulated as 
DFTs over finite fields. 

Recently, cyclotomic fast Fourier transforms (CFFTs) over 
finite fields have been used to reduce the complexities of RS 
decoders (6), (7). CFFTs proposed in (6), (SJ, (9) have low 
multiplicative complexities, but they have very high additive 
complexities. By using techniques such as the common subex- 
pression elimination (CSE) algorithm in flU) , the additive 



complexities of CFFTs can be significantly reduced, leading 
to small overall computational complexities for DFTs with 
lengths up to 1024 [10|. By treating syndrome computation, 
Chien search, and error magnitude evaluation as partial CFFTs 
or dual partial CFFTs, the overall computational complexities 
of these steps can be significantly reduced for short and 
medium RS codes |6), Q. Unfortunately, this approach will 
not be feasible for long DFTs and hence long RS codes. This 
is because the CSE algorithm itself has a prohibitively high 
computational complexity when applied to long DFTs. With- 
out the CSE algorithm, the overall computational complexities 
of CFFTs will be higher than other approaches due to their 
additive complexities. 

In this paper, we devise reduced-complexity decoders for 
long RS codes based on composite cyclotomic Fourier trans- 
forms (CCFTs) pT) . CCFTs first decompose long DFTs with 
composite lengths into short sub-DFTs via the prime-factor 
algorithm |12| or the Cooley-Tukey algorithm [13|, and then 
implement the sub-DFTs with CFFTs. We remark that CFFTs 
are special cases of CCFTs corresponding to trivial decompo- 
sitions. The decomposition leads to significantly reduced addi- 
tive complexities at the expense of multiplicative complexities, 
resulting in lower overall computational complexities than 
CFFTs for moderate to long DFTs in practice Further- 
more, the decomposition also endows CCFTs with modular 
structures, which are suitable for hardware implementations. 

The main contributions of this paper are as follows: 

• We first propose partial CCFTs and then apply them to 
implement syndrome computation, Chien search, and er- 
ror magnitude evaluation of RS decoders. Partial CCFTs 
not only inherit the two advantages (lower additive com- 
plexities and modular structures) of full CCFTs, their 
two-tier structure is also suitable for the implementation 
of decoders for shortened RS codes. For instance, for 
DFTs in shortened RS codes, certain time-domain ele- 
ments are zeros and certain frequency-domain compo- 
nents are not needed. For partial CFFT, either property 
can lead to multiplicative complexity reduction but not 
both at the same time. The two-tier structure of CCFT, 
however, enables us take advantage of both properties 
simultaneously to reduce the multiplicative complexity. 
Consequently, our results show that partial CCFTs leads 
to a significant saving of computational complexities for 
long RS codes. 

• To further verify and demonstrate the advantages of 
partial CCFTs, we implement in hardware the syndrome 
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computation block for a (2720, 2550) shortened RS code 
over GF(2 12 ). In comparison to previous results based 
on Horner's rule, our hardware implementation not only 
has a smaller gate count, but also achieves much higher 
throughputs. 

The rest of this paper is organized as follows. We review 
CFFTs and CCFTs in Sec. [II] Sec. Ill first proposes partial 
CCFTs and then presents RS decoders based CCFTs. The 
hardware implementation results are provided in Sec 
Finally, our paper concludes in Sec. [V] 

II. Background 
A. CFFTs and CCFTs over Finite Fields 

Assuming that a e GF(2 m ) is an element of order n, the 
DFT of an n-dimensional vector f = (/o,/i, ••• ,/n-i) T 
over GF(2 m ) is given by F = (/(a ), /(a 1 ), ■ • • , f(a n ~ 1 )) T , 
where f(x) = Y^i=o /j 2 -*- That is, DFTs can be viewed as 
polynomial evaluations. The vector f is said to be in the 
time domain and F in the frequency domain. Direct CFFTs 
(DCFFTs) fU formulate the DFTs as F = ALT', where 
A is an n x n binary matrix, L a block diagonal matrix 
with each block cyclic, and f a permutation of f. Since 
the multiplication between a cyclic matrix and a vector can 
be done by efficient bilinear algorithm of cyclic convolution, 
CFFTs can be computed by F = AQ(c • Pf), where Q 
and P are binary matrices, c is a pre-computed vector, and • 
denotes an entry-wise multiplication between two vectors. Two 
variants of DCFFTs, referred to as inverse CFFTs (ICFFTs) 
[6 1 and symmetric CFFTs (SCFFTs) Q, respectively, compute 
the DFTs by F = L _1 A~ 1 f / and F = L T A T f, respectively. 
Since it has been shown that ICFFTs and SCFFTs are equiv- 
alent [10 1, without loss of generalization we consider only 



DCFFTs and SCFFTs in this paper. 

The composite cyclotomic Fourier transform in 1 1 1 1 can 
further reduce the overall computational complexity by de- 
composing the long DFTs into short sub-DFTs via the prime- 
factor algorithm p2| or the Cooley-Tukey algorithm [13]. The 
decompositions of the DFTs reduce the additive computational 
complexity directly. Moreover, because of the short length of 
the sub-DFTs, sophisticated tools such as the CSE algorithm in 
[10 1, can be readily used to reduce the additive complexities 
of CCFTs. CCFTs also have a modular structure, which is 
desirable in hardware implementation. The sub-DFTs can be 
used as sub-modules, which can be reused to save chip area 
or parallelized to increase the throughput. 

B. Reed-Solomon Decoders based on CFFTs 

Henceforth in this paper, we focus on cyclic Reed-Solomon 
(RS) codes, which can be decoded by syndrome-based de- 
coders considered herein |T[. For an (n, k) cyclic RS code 
over GF(2 m ) with n\2 m — 1 and n — k = 2i, it can correct 
up to t errors or 2t erasures. An (n',k') shortened RS code 
can be viewed as a sub-code of an (n, k) RS code where the 
symbols at the position i> n' are always zero. For a received 
vector r = (r ,ri,--- ,r n _i) T , the syndrome-based errors- 
only (errors-and-erasures, respectively) decoder of RS codes 
in the time domain consists of the following three steps JT): 



1) Compute the 2t syndromes Sj = J27=o r i a% ^ f° r < 
j < 2t — 1, where a is an n-th primitive element. 

2) Compute the error (errata) locator polynomial A(x) 
and error (errata) evaluator polynomial fl(x) by the 
Berlekamp-Massey algorithm (BMA) or the extended 
Euclidean algorithm. 

3) Find the error (errata) positions by the Chien search. That 
is, the error positions are obtained by finding the root of 
A(x). Find the error (errata) value by Forney's formula, 
which evaluates Cl(x) and A'(x) (formal derivative of 
A(ir)) at the error (errata) positions. 

Since evaluating a polynomial at multiple points can be 
implemented as a DFT, DFTs can be used to reduce the 
computational complexity of steps 1 and 3. When DFTs are 
used to implement syndrome computation in the RS decoder, 
only 2t frequency-domain elements are needed. Hence, the 
unnecessary rows and columns of the matrices in DCFFTs 
or SCFFTs can be removed to reduce both multiplicative 
and additive complexities, resulting in partial DCFFTs and 
partial SCFFTs. Similarly, when DFTs are used to evaluate the 
error (errata) locator and evaluator polynomials, many time- 
domain elements are zeroes due to the limited degrees of both 
polynomials. Again the unnecessary rows and columns of the 
matrices in DCFFTs and SCFFTs can be removed, leading 
to dual partial DCFFTs and dual partial SCFFTs. Since a 
shortened RS code is essentially a RS code with zero symbols, 
these zero symbols are treated as zero time-domain elements. 
When DFTs are used to implement syndrome computation, the 
Chien search, and Forney's formula, these DFTs are partial in 
both time and frequency domains. 

Although the complexity of the Berlekamp-Massey algo- 
rithm is important to efficient RS decoders, the implementation 
of the Berlekamp-Massey algorithm is not considered hence- 
forth in this paper, since the computational complexity of the 
Berlekamp-Massey algorithm cannot be reduced by DFTs. 

III. RS Decoders Based on Partial Composite 
Cyclotomic Fourier Transforms 

In this section, we first propose partial CCFTs and then 
devise syndrome-based time-domain RS decoder based on 
our partial CCFTs. The complexities of our RS decoder are 
compared with previous works in the literature. 

A. Partial Composite Cyclotomic Fourier Transforms 

When N = NxN 2 , with the prime-factor algorithm |12| 



or the Cooley-Tukey algorithm [13], an A-point CCFT can 
be carried out in a two-tier structure. The first tier performs 
N2 Ai-point CFFTs and the second performs N\ A^-point 
CFFTs. When the greatest common divisor of Ni and N2 is 
greater than one, twiddle factors are needed. When Ni and 
N2 are co-prime to each other, no twiddle factor is required. 
When Ni or N2 is composite, Ni- or N2 -point DFTs can 
be further decomposed, leading to multi-tier structure. Fig. [T] 
shows the two-tier structure of a 3 x 5 CCFT, where the first 
tier consists of five 3-point CFFTs and the second tier three 
5-point CFFTs. This regular and modular structure is suitable 
for hardware implementations, since it is much easier to apply 
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Fig. 1. The regular and modular structure of our 15-point CCFT based on 
a 3 X 5 decomposition. 



architectural techniques such as folding and pipelining to this 
regular and modular structure, leading to efficient hardware 
implementations . 

When some frequency-domain components are not needed 
or some of the time-domain elements are always zeroes, the 
corresponding rows and columns of matrices in the sub-CFFTs 
can be removed, resulting in partial CCFTs. As shown in fTT) , 
CCFTs have lower computational complexities than CFFTs in 
evaluating long DFTs, and hence we expect that partial CCFTs 
have advantages in reducing the computational complexities of 
decoders for long RS codes. 

We remark that if we decompose an TV-point DFT as 1 X N, 
the corresponding partial CCFT will reduce to partial SCFFT, 
and if we decompose the DFT as N x 1, the corresponding 
partial CCFT will reduce to partial DCFFT. Therefore, our 
partial CCFTs include partial DCFFTs and partial SCFFTs 
as special cases. In this sense, DFT decomposition pro- 
vides another degree of freedom to reduce the computational 
complexities of DFTs. In the following, we focus on the 
computational complexities of partial CCFTs with non-trivial 
decompositions, i.e., decompositions other than 1 x JV and 
Nxl. 

We discuss the complexity of partial CCFTs can be reduced 
based on partial time or frequency domain elements, and 
compare partial CCFTs with partial CFFTs. Assuming a two- 
tier structure for simplicity, there are three possible scenarios: 

1) When limited frequency domain elements are needed. 

For RS codes, when DFTs are used to compute the syn- 
dromes of a received vector, only the first 2t frequency- 
domain components are needed. The results in [7] show 
that the multiplicative complexity of a partial SCFFT is 
reduced greatly, but because the matrix A is not sparse, 
it is hard to reduce the multiplicative complexity of a 
partial DCFFT. Even though partial DCFFTs have smaller 
additive complexities than partial SCFFTs, they have 
higher overall computational complexities. For partial 
CCFTs, the multiplicative complexity of the second tier 
can be directly reduced due to the unnecessary frequency- 
domain components. However, since computing even 
one frequency-domain component of an N2 -point vector 



requires all of the time-domain elements, the outputs of 
the DFTs in the first tier may only have unnecessary 
frequency-domain components in some rare cases, e.g., 
the number of the DFTs in the second tier is more than 
that of the necessary frequency-domain components, and 
hence the complexity of the DFTs in the first tier cannot 
be reduced in most cases. Thus, the complexity reduction 
of partial CCFTs is not as great as partial CFFTs. 

2) When some time domain elements are zero. 

For RS codes, when DFTs are used to reduce the com- 
putational complexities of Chien search and error evalu- 
ation, only a few time domain components are non-zero, 
and hence partial DCFFTs can reduce the multiplicative 
complexities greatly and have lower overall complexities. 
For partial CCFTs, the multiplicative complexity of the 
first tier can be directly reduced due to the zero time 
domain components, while the complexity of the second 
tier cannot be easily reduced unless in rare cases. 

3) When limited frequency domain elements are needed 
and some time domain elements are zero. 

For shortened RS codes, only part of the time-domain 
elements are nonzero and only part of the frequency- 
domain components are needed. Neither partial DCFFTs 
nor partial SCFFTs can take full advantage of both prop- 
erties simultaneously. In contrast, the two-tier structure 
of partial CCFTs is advantageous. Due to the two-tier 
structure of CCFT, we can use DCFFTs in the first tier 
and SCFFTs in the second tier to reduce the multiplicative 
complexities as well as the overall complexities. 
Example 1: Consider a ^15, 11) RS code over GF(2 4 ) with 
a generator polynomial rii=o( a ' — a% )> wnere a is a root of 



the primitive polynomial x 



1. This code can correct up 



to two errors or four erasures, and hence we need to compute 
the first four frequency-domain components in the DFT of a 
received codeword as the syndrome. We can decompose the 
15-point DFT as 3 x 5 CCFT by the prime-factor algorithm as 
shown in Fig. [T] The 3-point SCFFT in the first tier is given 
by 







"1 





Q 


0" 









1 


1 





.if>_ 







1 





1 



(3) 



\ 



and the 5-point DCFFT in the second tier is given by 
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Since we need to compute the first four frequency components, 
from Fig. [T]we need the first and the fourth output from the 
first 5-point CCFT module, the second output from the second 
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one, and the third output from the third one. Then the 5-point 
DFT modules can be simplified by removing the unnecessary 
computations accordingly. For example, when we simplify the 
first 5-point DFT module, We can remove the second, third, 
and fifth rows in AQ, resulting in the fourth and fifth column 
containing only zero. Then the corresponding rows in c and P 
can be removed, thus reducing the additive and multiplicative 
complexities. This is a similar reduction procedure with the 
partial CFFT However, the DFT modules in the first tier 
cannot be simplified because all the outputs of these modules 
are required for the computation in the second tier. 

Example 2: Now let us consider a (15, 13) RS code which 
can correct one error or up to two erasures. Only the first 
two frequency-domain components are needed and we still 
decompose the 1 5-point DFT by the prime-factor algorithm 
as 3 x 5 CCFT From Fig [T] no output from the third 5- 
point DFT module is needed and hence it can be removed. 
Therefore, the last output from each 3-point DFT modules in 
the first tier is not needed, and hence they can be simplified 
by removing unnecessary computations accordingly. Only in 
this kind of cases, i.e., the number of the required frequency- 
domain components is less than the number of DFTs in the 
second tier, the computational complexity of the first tier can 
be reduced. 

Example 3: Consider a (10,6) RS code shortened from a 
(15, 11) code. In the syndrome computation step, we still need 
four frequency components, which implies the 5-point DFTs 
in the second tier can be simplified in the same way with 
Example 1. Moreover, as the input /io, fix, • • • , fu are zero, 
the 3-point DFT modules in the first tier connecting to these 
inputs can be accordingly simplified. 

These examples are relatively small, and they do not have 
smaller complexities than the corresponding partial CFFTs. 
However, we can expect that the partial CCFT will have 
smaller computational complexity as the length of RS code 
increase. 

B. Syndrome Computation 

For an (n, k) RS code, the syndromes of a received vector 
r = (r ,ri,--- ,r„_i) T are given by Sj = Yh=o r i a ^ for 
< j < 2t — 1, which are the first 2t frequency domain 
elements of the DFT of r and can be computed with our partial 
CCFT. For an (n',k') RS codes shortened from the (n,k) 
codes, we can still use the n-point partial CCFT to compute 
the syndrome, provided that the time-domain elements of the 
CCFT input with indexes i > n' are set to zero. The partial 
CCFT can be then simplified correspondingly by removing the 
unnecessary computations. 

Due to their widespread applications, we select the 
(255,223), (511,447), and (1023,895) RS codes over 
GF(2 8 ), GF(2 9 ), and GF(2 10 ), respectively, as examples to 
show computational complexity reduction by partial CCFTs. 
We also select two shortened RS codes with parameters 
(2720,2550) ||) and (3073,2731) over GF(2 12 ) to illus- 
trate the advantage of the two-tier structure. 

We compare the complexities of syndrome computation for 
the five RS codes mentioned above based on partial CCFTs, 



partial SCFFT , prime-factor algorithm [14|, and Horner's rule 
1 1 1 in Tab. [I] For partial CCFTs, we have tried all possible 
decompositions of the DFT lengths, and only the non-trivial 
decompositions with the smallest computational complexities 
are listed in Tab. [I] Note that due to the extreme code length, 
the additive complexities of the syndrome computation for 
the two shortened RS codes over GF(2 12 ) based on partial 
CFFTs are not optimized with the CSE algorithm in [10]. The 
total complexity in Tab. [I] is defined to be a weighted sum 
of the additive and multiplicative complexities. We assume 
that one multiplication has the same complexity as (2m — 1) 
additions over the same field. This assumption comes from 
both the hardware and software considerations (TDJ. In Tab. [I] 
the smallest total complexities for all the codes are in boldface. 

From Tab. [I] we can see that both partial CCFTs and partial 
SCFFTs have much smaller complexities than the Horner's 
rule, which is used widely in practice. In GF(2 8 ), partial CCFT 
have a higher multiplicative complexity than partial SCFFT. 
However, due to the reduced additive complexities, partial 
CCFTs have advantages in smaller overall computational 
complexities in GF(2 m ) when m — 9 or 10, although the 
improvement is marginal in GF(2 9 ) and GF(2 10 ), roughly 1% 
and 4%, respectively. Due to the sub-optimality of the CFFT 
and the efficiency of the CCFT for long DFTs, the savings will 
be greater for larger fields. For the two shortened RS codes 
over GF(2 12 ), the total complexities based on partial CCFTs 
are only a fraction of those based on partial CFFTs. 

C. Chien Search and Error Magnitude Evaluation 

In RS decoders, the Chien search is used to determine the 
error (errata) locations by finding the roots of the error (errata) 
locator polynomial A(x). It is implemented by evaluating A(x) 
at all points a 1 in the finite fields GF(2 m ) with < i < 2 m -2, 
which can be done efficiently by fast DFT algorithms such as 
partial CCFT in our paper. The input vector of the DFT only 
has at most 2t + l nonzero elements. For shortened (n', k') RS 
codes, possible error (errata) locations must be less than n'. 
Therefore, only the first n' frequency-domain components are 
needed, and hence partial CCFT can be simplified accordingly. 

For the RS codes we study, Forney's formula [ 1 ] is given by 
Yi = — ttt^t I i i where K is the error (errata) magnitude 
at the i-th error (errata) located at position j, and A'(x) is the 
formal derivative of A(x). Although we evaluate fl(x) and 
A(x) only at the points corresponding to the error locations, 
the error locations are variable from one received vector to 
another. Therefore, we can evaluate £l(x) and A'(x) at all the 
points in the finite field using partial CCFT, and then select 
the frequency-domain components corresponding to the error 
locations. 

Moreover, we can combine the computation of the Chien 
search and Forney's formula by splitting the polynomial A(x) 
into A e (x) + A (x), where A e (a:) and A Q (x) are the sums of 
the terms in A(x) with even and odd degrees, respectively. It is 
easy to verify that in GF(2 m ), xA'(x) — A Q (x). Hence we can 
first evaluate the three polynomials fl(x), A e (x), and A (a;) at 
all points in the finite field by partial CCFT, and then compute 
A(a) by A e (a) + A„(a) for all a e GF(2 m ) with n additional 
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TABLE I 

Complexity comparison of the syndrome computation for errors-and-erasures RS decoders. 



Field 


code 


Partial CCFT 


Partial SCFFT |7| 


Prime-factor 1 14] 


Horner's rule 1 1] 


ni X ri'z 


Mult. 


Add. 


Total 


Mult. 


Add. 


Total 


Mult. 


Add. 


Total 


Mult. 


Add. 


GF(2») 


(255, 223) 


3 x 85 


252 


2652 


6432 


149 


3970 


6205 


852 


1804 


14584 


7874 


8128 


GF(2 9 ) 


(511, 447) 


7 x 73 


873 


7268 


22109 


345 


16471 


22336 


5265 


7309 


35496 


32130 


32640 


GF(2 10 ) 


(1023, 895) 


31 x 33 


2868 


18569 


73061 


824 


60471 


76397 


6785 


15775 


144690 


129794 


130816 


GF(2 12 ) 


(2720, 2550) 


63 x 65 


7565 


63869 


237864 


1467 


1244779 


1278520 








459511 


462230 


GF(2 12 ) 


(3073, 2731) 


63 x 65 


9268 


82684 


295848 


2782 


2760210 


2824196 








1047552 


1050624 



TABLE II 

Complexity comparison of combined Chien search and Forney's formula for errors-and-erasures RS decoders. 



Field 


code 


Partial CCFT 


Partial DCFFT 


Horner's Rule 1 1 


n\ X ri2 


Mult. 


Add. 


Div. 


Total 


Mult. 


Add. 


Div. 


Total 


Mult. 


Add. 


Div. 


Total 






S L J 


85 x 3 


252 


2764 





6544 


149 


3226 





5461 


992 


992 





15872 


GF(2 8 ) 


(255, 223) 




85 x 3 


177 


1845 





4500 


78 


1828 





2998 


4064 


4080 





65040 


A In-} 
1\ [X) 


85 x 3 


191 


2230 





5095 


108 


3096 





4716 


4064 


3825 





64785 






Misc 







255 


32 







255 


32 







255 


32 








Total 


13175 + 32 divisions 


145952 + 32 divisions 






Q(x) 


73 x 7 


834 


6013 





20191 


345 


12791 





18656 


4032 


4032 





72576 


GF(2 9 ) 


(511, 447) 


K e {x) 


73 x 7 


658 


4353 





15539 


177 


7802 





10811 


16320 


16352 





293792 


A (x) 


73 x 7 


678 


4684 





16210 


248 


12533 





16749 


16320 


15841 





293281 






Misc 







511 


64 







511 


64 







511 


64 








Total 








4618 


8 + 64 divisions 








660160 + 64 divisions 






n(x) 


33 x 31 


2687 


16743 





67796 


824 


52557 





68213 


16256 


16256 





325120 


GF(2 10 ) 


(1023, 895) 


A e (x) 


33 x 31 


2295 


14718 





58323 


430 


30294 





38464 


65408 


65472 





1308224 


A a (x) 


33 x 31 


2291 


14523 





58052 


541 


51655 





61934 


65408 


64449 





1307201 






Misc 







1023 


128 







1023 


128 







1023 


128 








Total 


165335 + 128 c 


lvisions 


2941568 + 128 divisions 






n{x) 


65 x 63 


7807 


65253 





244814 


1542 


1326289 





1361755 


28730 


28730 





689520 


GF(2 12 ) 


(2720, 2550) 


A e (x) 


65 x 63 


6889 


57631 





216078 


787 


691858 





709959 


231115 


231200 





5546845 


A (x) 


65 x 63 


6897 


57095 





215726 


1082 


1320622 





1345508 


231115 


228480 





5544125 






Misc 







2720 


170 







2720 


170 







2720 


170 








Total 


679338 + 170 c 


lvisions 


11780490+170 divisions 






n(a;) 


65 x 63 


9610 


77852 





298882 


2908 


2760306 





2827190 


116622 


116622 





2798928 


GF(2 12 ) 


(3073, 2731) 


A e (x) 


65 x 63 


8033 


66641 





251400 


1550 


1497544 





1533194 


525312 


525483 





12607659 


A (x) 


65 x 63 


8018 


65968 





250382 


2041 


2751557 





2798500 


525312 


522410 





12604586 






Misc 







3073 


342 







3073 


342 







3073 


342 








Total 


803737 + 342 c 


lvisions 


28014246 + 342 divisions 



additions. The error locations are the points where A(x) = 0. 
With Forney's formula, the error (errata) magnitudes can be 
computed with at most t divisions (2t divisions). 



In Tab. [IIJ we compare the computational complexity of 
combined Chien search and Forney's formula based on partial 
CCFTs with non-trivial decompositions, partial DCFFTs, and 
Horner's rule for the five RS codes and shortened RS codes 
discussed in Sec. III-B The choices of partial CCFTs and 
CFFTs do not affect the number of divisions. Similar to 



syndrome computation, the advantage of using partial CCFTs 
(with non-trivial decompositions) instead of partial CFFTs is 
rather limited for RS codes over GF(2 m ) when m < 10. 
However, the advantage of partial CCFTs is much greater 
in larger fields. Again for the two shortened RS codes over 
GF(2 12 ), the total complexities based on partial CCFTs are 
only a fraction of those based on partial CFFTs. Finally, since 
partial CFFTs are special cases of partial CCFTs with trivial 
decomposition, we can choose the most efficient algorithm 
to evaluate £l(x), A e (x), and A (x), respectively. In Tab. [u] 
the total complexity of combined Chien search and Forney's 
formula based on partial CCFTs/CFFTs is also provided. 



IV. Hardware Implementations 

The additive and multiplicative complexities derived in 
Sec. Ill considers only the total number of the additions 



and multiplications required by partial CCFTs. Although this 
metric is a good estimation of the computational complexities, 
it reflects only part of the hardware complexities. For example, 
buffers, multiplexers and control units are required if we want 
to reuse modules to save chip area, and their complexities 
need to be accounted for. Thus, in this section hardware 
implementations are used to further verify and demonstrate 
the advantages of partial CCFTs. 

In the literature, numerous syndrome-based RS decoder 
designs use the Horner's rule (Tl to implement the syndrome 
computation, Chien search, and Forney's formula. Since we 
want to replace the Horner's rule by partial CCFT, the syn- 
drome computation module is representative to illustrate the 
advantages of the partial CCFT. Although the architecture 
and hardware design of RS decoders are well-studied in 
the literature, there are few results on the RS codes over 
GF(2 12 ) due to their extreme lengths. Therefore, in this 
section, we choose to implement in hardware the syndrome 
computation block for the (2720, 2550) shortened RS code 
in (2| as an example, because detailed synthesis results of 
the syndrome computation block are provided in Q. Two 
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VLSI designs synthesized with 0.18 /im CMOS technology are 
provided in G) with different parallelization parameters. We 
also implement this block with partial CCFTs, and synthesize 
it with a more advanced 45 nm technology fl5) . No hardware 
implementation results is provided in |7). Given the extreme 
length of this code, since the CSE algorithm cannot be used to 
reduce additive complexities of partial CFFTs, partial CCFTs 
have a significant advantage against partial CFFTs, as shown 
in Tabs. U and HH 

A. Hardware Implementations 

When we use partial CCFTs to compute the syndrome for 
the (2720, 2550) RS code, 2720 time-domain elements and 
170 frequency-domain components are needed in the 4095- 
point DFT If we implement this block in a fully parallel 
fashion, the computational complexity in Tab. [I] is a good 
estimate of the hardware complexity. However, the hardware 
complexity is too large to be used in practice. Fortunately, 
the modular structure of partial CCFTs enables us to fold 
the architecture. Since the CCFTs decompose the long DFTs 
into several short sub-DFTs, those sub-DFTs can be used as 
modules in hardware implementations. They can be reused 
to save the chip area and power consumption, or pipelined 
and parallelized to increase the throughput. This is a desirable 
property in hardware implementation of the RS decoders. 

In our hardware implementation, we first decompose the 
4095-point DFT as 63 x 65 as suggested by Tab. [I] i.e., first 
compute 65 63-point DFTs and then compute 63 65-point 
DFTs. To compute these DFTs in one clock cycle in a fully 
parallel way, it requires 65 63-point DFT modules and 63 65- 
point DFT modules. This straightforward implementation has 
very high complexity. Instead, we carry out the partial CCFT 
in two steps. The first step computes the 65 63-point DFTs in 
7\ clock cycles, each cycle computing at most [~65/7xl 63- 
point DFTs; and the second step computes the 63-point DFTs 
in T2 clock cycles, each cycle computing at most [63/Ta] 
65-point DFTs. Therefore, we can compute the partial CCFT 
in Ti + T 2 cycles with [65/7x1 63-point DFT modules and 
[63/T21 65-point DFT modules. These 63-point DFT modules 
and 65-point DFT modules are implemented by CFFTs to 
reduce their complexities, and the computations involving the 
zero time-domain inputs and/or unnecessary frequency-domain 
components are removed. 

B. Implementation Results and Remarks 

We provide two hardware designs with (Ti,T2) equal to 
(13, 9) and (5, 7), respectively. The synthesis results are shown 
in Tab. [Ill} and they are compared with the two designs with 
different parallelization parameters in (2). Due to the different 
process technologies used in the synthesis, the clock rates 
can not be compared directly. We provide both clock rates 
as well as throughputs of all implementations (the throughput 
is defined as the number of vectors that can be processed 
in each second). The equivalent gate count is computed by 
dividing the total chip area by the area of an XOR gate in 
the corresponding technology, and it can serve as a metric to 
compare designs in different process technologies. 



TABLE III 

Comparison of the VLSI implementations of the syndrome 

COMPUTATION FOR (2720, 2550) RS CODE. 





Partial CCFT 


Honer's Rule |2 


(Ti,T 2 ) 


(5,7) 


(13,9) 






Process 


45 nm 


45 nm 


0.18 /im 


0.18 ^tm 


Clock rate 


250 MHz 


200 MHz 


112 MHz 


225 MHz 


Gate count 


384k 


306k 


920k 


480k 


Require cycles 


12 


22 


86 


171 


Throughput (vec/s) 


20.8M 


9.1M 


1.3M 


1.3M 



From Tab. [TTT] we can see that both the gate count and re- 
quired cycles are reduced greatly compared with the designs in 
|2| because a partial CCFT has a much smaller computational 
complexity than Horner's rule. With partial CCFTs, we can 
design an RS decoder with smaller area and larger throughput 
because of reduced gate counts and required numbers of 
cycles, respectively. 

Due to the modular structure of partial CCFTs, we can 
make a wide range of trade-offs between the chip area and 
throughput. We can reduce the number of the required cycles 
by increasing the number of sub-DFT modules in each tier, and 
the chip area is therefore increased. For example, if we reduce 
the required cycles from 22 to 12, the gate count increases 
from 306k to 384k as shown in Tab. [ill] In contrast, it is 
not easy for partial CFFTs to make such trade-offs because 
of the irregular structure of the post-addition network for 
partial CFFTs (see [7]). Moreover, since we compute the sub- 
DFTs by CFFTs, which are implemented as bilinear algorithms 
and also have modular structure, we can shorten the critical 
path and improve the clock rate by pipelining the sub-DFT 
modules, i.e., inserting pipeline registers between pre-addition 
network, multipliers, and post-addition network. 

We remark that we focus on the decomposition 63 x 65 for 
the 4095-point DFT above. Other decompositions, even multi- 
tier structure decomposition, can be considered. For example, 
a decomposition 7 x 9 x 5 x 13 would lead to a four-tier 
structure, which leads to a smaller critical path delay since 
the the sub-DFTs in each tier are smaller and they can be 
pipelined. 

V. Conclusion 

We extend our previous work in fTT) by proposing partial 
CCFT to reduce the computational complexity of syndrome 
based RS decoder. Our results show that partial CCFTs have 
advantages in reducing the computational complexity of the 
DFTs, which can be used to implement the syndrome com- 
putation, Chien search, and Forney's formula. The hardware 
implementation results show that since the computational 
complexity is reduced greatly, smaller chip area and fewer 
clock cycles are needed to compute the syndrome of the a 
received vector. Moreover, the modular structure of partial 
CCFT provides a wide range of trade-offs between the chip 
area and throughput, which is a favorable property in hardware 
designs. 
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